Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

When Was It Written? Automatically Determining Publication Dates

Identifieur interne : 000403 ( Main/Exploration ); précédent : 000402; suivant : 000404

When Was It Written? Automatically Determining Publication Dates

Auteurs : Anne Garcia-Fernandez [France] ; Anne-Laure Ligozat [France] ; Marco Dinarelli [France] ; Delphine Bernhard [France]

Source :

RBID : ISTEX:1A59EF625F8BD73E9C9DA7E5A6068BBA5B49114B

Abstract

Abstract: Automatically determining the publication date of a document is a complex task, since a document may contain only few intra-textual hints about its publication date. Yet, it has many important applications. Indeed, the amount of digitized historical documents is constantly increasing, but their publication dates are not always properly identified via OCR acquisition. Accurate knowledge about publication dates is crucial for many applications, e.g. studying the evolution of documents topics over a certain period of time. In this article, we present a method for automatically determining the publication dates of documents, which was evaluated on a French newspaper corpus in the context of the DEFT 2011 evaluation campaign. Our system is based on a combination of different individual systems, relying both on supervised and unsupervised learning, and uses several external resources, e.g. Wikipedia, Google Books Ngrams, and etymological background knowledge about the French language. Our system detects the correct year of publication in 10% of the cases for 300-word excerpts and in 14% of the cases for 500-word excerpts, which is very promising given the complexity of the task.

Url:
DOI: 10.1007/978-3-642-24583-1_22


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">When Was It Written? Automatically Determining Publication Dates</title>
<author>
<name sortKey="Garcia Fernandez, Anne" sort="Garcia Fernandez, Anne" uniqKey="Garcia Fernandez A" first="Anne" last="Garcia-Fernandez">Anne Garcia-Fernandez</name>
</author>
<author>
<name sortKey="Ligozat, Anne Laure" sort="Ligozat, Anne Laure" uniqKey="Ligozat A" first="Anne-Laure" last="Ligozat">Anne-Laure Ligozat</name>
</author>
<author>
<name sortKey="Dinarelli, Marco" sort="Dinarelli, Marco" uniqKey="Dinarelli M" first="Marco" last="Dinarelli">Marco Dinarelli</name>
</author>
<author>
<name sortKey="Bernhard, Delphine" sort="Bernhard, Delphine" uniqKey="Bernhard D" first="Delphine" last="Bernhard">Delphine Bernhard</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:1A59EF625F8BD73E9C9DA7E5A6068BBA5B49114B</idno>
<date when="2011" year="2011">2011</date>
<idno type="doi">10.1007/978-3-642-24583-1_22</idno>
<idno type="url">https://api.istex.fr/document/1A59EF625F8BD73E9C9DA7E5A6068BBA5B49114B/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000105</idno>
<idno type="wicri:Area/Istex/Curation">000103</idno>
<idno type="wicri:Area/Istex/Checkpoint">000059</idno>
<idno type="wicri:doubleKey">0302-9743:2011:Garcia Fernandez A:when:was:it</idno>
<idno type="wicri:Area/Main/Merge">000408</idno>
<idno type="wicri:Area/Main/Curation">000403</idno>
<idno type="wicri:Area/Main/Exploration">000403</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">When Was It Written? Automatically Determining Publication Dates</title>
<author>
<name sortKey="Garcia Fernandez, Anne" sort="Garcia Fernandez, Anne" uniqKey="Garcia Fernandez A" first="Anne" last="Garcia-Fernandez">Anne Garcia-Fernandez</name>
<affiliation wicri:level="1">
<country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, Orsay</wicri:regionArea>
<placeName>
<region type="région" nuts="2">Île-de-France</region>
<settlement type="city">Orsay</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
</author>
<author>
<name sortKey="Ligozat, Anne Laure" sort="Ligozat, Anne Laure" uniqKey="Ligozat A" first="Anne-Laure" last="Ligozat">Anne-Laure Ligozat</name>
<affiliation wicri:level="1">
<country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, Orsay</wicri:regionArea>
<placeName>
<region type="région" nuts="2">Île-de-France</region>
<settlement type="city">Orsay</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country xml:lang="fr">France</country>
<wicri:regionArea>ENSIIE, Evry</wicri:regionArea>
<placeName>
<region type="région">Île-de-France</region>
<settlement type="city">Évry (Essonne)</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
</author>
<author>
<name sortKey="Dinarelli, Marco" sort="Dinarelli, Marco" uniqKey="Dinarelli M" first="Marco" last="Dinarelli">Marco Dinarelli</name>
<affiliation wicri:level="1">
<country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, Orsay</wicri:regionArea>
<placeName>
<region type="région" nuts="2">Île-de-France</region>
<settlement type="city">Orsay</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
</author>
<author>
<name sortKey="Bernhard, Delphine" sort="Bernhard, Delphine" uniqKey="Bernhard D" first="Delphine" last="Bernhard">Delphine Bernhard</name>
<affiliation wicri:level="1">
<country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, Orsay</wicri:regionArea>
<placeName>
<region type="région" nuts="2">Île-de-France</region>
<settlement type="city">Orsay</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<imprint>
<date>2011</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">1A59EF625F8BD73E9C9DA7E5A6068BBA5B49114B</idno>
<idno type="DOI">10.1007/978-3-642-24583-1_22</idno>
<idno type="ChapterID">22</idno>
<idno type="ChapterID">Chap22</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: Automatically determining the publication date of a document is a complex task, since a document may contain only few intra-textual hints about its publication date. Yet, it has many important applications. Indeed, the amount of digitized historical documents is constantly increasing, but their publication dates are not always properly identified via OCR acquisition. Accurate knowledge about publication dates is crucial for many applications, e.g. studying the evolution of documents topics over a certain period of time. In this article, we present a method for automatically determining the publication dates of documents, which was evaluated on a French newspaper corpus in the context of the DEFT 2011 evaluation campaign. Our system is based on a combination of different individual systems, relying both on supervised and unsupervised learning, and uses several external resources, e.g. Wikipedia, Google Books Ngrams, and etymological background knowledge about the French language. Our system detects the correct year of publication in 10% of the cases for 300-word excerpts and in 14% of the cases for 500-word excerpts, which is very promising given the complexity of the task.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
</country>
<region>
<li>Île-de-France</li>
</region>
<settlement>
<li>Orsay</li>
<li>Évry (Essonne)</li>
</settlement>
</list>
<tree>
<country name="France">
<region name="Île-de-France">
<name sortKey="Garcia Fernandez, Anne" sort="Garcia Fernandez, Anne" uniqKey="Garcia Fernandez A" first="Anne" last="Garcia-Fernandez">Anne Garcia-Fernandez</name>
</region>
<name sortKey="Bernhard, Delphine" sort="Bernhard, Delphine" uniqKey="Bernhard D" first="Delphine" last="Bernhard">Delphine Bernhard</name>
<name sortKey="Bernhard, Delphine" sort="Bernhard, Delphine" uniqKey="Bernhard D" first="Delphine" last="Bernhard">Delphine Bernhard</name>
<name sortKey="Dinarelli, Marco" sort="Dinarelli, Marco" uniqKey="Dinarelli M" first="Marco" last="Dinarelli">Marco Dinarelli</name>
<name sortKey="Dinarelli, Marco" sort="Dinarelli, Marco" uniqKey="Dinarelli M" first="Marco" last="Dinarelli">Marco Dinarelli</name>
<name sortKey="Garcia Fernandez, Anne" sort="Garcia Fernandez, Anne" uniqKey="Garcia Fernandez A" first="Anne" last="Garcia-Fernandez">Anne Garcia-Fernandez</name>
<name sortKey="Ligozat, Anne Laure" sort="Ligozat, Anne Laure" uniqKey="Ligozat A" first="Anne-Laure" last="Ligozat">Anne-Laure Ligozat</name>
<name sortKey="Ligozat, Anne Laure" sort="Ligozat, Anne Laure" uniqKey="Ligozat A" first="Anne-Laure" last="Ligozat">Anne-Laure Ligozat</name>
<name sortKey="Ligozat, Anne Laure" sort="Ligozat, Anne Laure" uniqKey="Ligozat A" first="Anne-Laure" last="Ligozat">Anne-Laure Ligozat</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000403 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000403 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:1A59EF625F8BD73E9C9DA7E5A6068BBA5B49114B
   |texte=   When Was It Written? Automatically Determining Publication Dates
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024